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Abstract 

Despite  the  existence  of  several  noun  phrase  coref¬ 
erence  resolution  data  sets  as  well  as  several  for¬ 
mal  evaluations  on  the  task,  it  remains  frustratingly 
difficult  to  compare  results  across  different  corefer¬ 
ence  resolution  systems.  This  is  due  to  the  high  cost 
of  implementing  a  complete  end-to-end  coreference 
resolution  system,  which  often  forces  researchers 
to  substitute  available  gold-standard  information  in 
lieu  of  implementing  a  module  that  would  compute 
that  information.  Unfortunately,  this  leads  to  incon¬ 
sistent  and  often  unrealistic  evaluation  scenarios. 

With  the  aim  to  facilitate  consistent  and  realis¬ 
tic  experimental  evaluations  in  coreference  resolu¬ 
tion.  we  present  Reconcile,  an  infrastructure  for  the 
development  of  learning-based  noun  phrase  (NP) 
coreference  resolution  systems.  Reconcile  is  de¬ 
signed  to  facilitate  the  rapid  creation  of  corefer¬ 
ence  resolution  systems,  easy  implementation  of 
new  feature  sets  and  approaches  to  coreference  res¬ 
olution,  and  empirical  evaluation  of  coreference  re¬ 
solvers  across  a  variety  of  benchmark  data  sets  and 
standard  scoring  metrics.  We  describe  Reconcile 
and  present  experimental  results  showing  that  Rec¬ 
oncile  can  be  used  to  create  a  coreference  resolver 
that  achieves  performance  comparable  to  state-of- 
the-art  systems  on  six  benchmark  data  sets. 

1  Introduction 

Noun  phrase  coreference  resolution  (or  simply 
coreference  resolution)  is  the  problem  of  identi¬ 
fying  all  noun  phrases  (NPs)  that  refer  to  the  same 
entity  in  a  text.  The  problem  of  coreference  res¬ 
olution  is  fundamental  in  the  field  of  natural  lan¬ 
guage  processing  (NLP)  because  of  its  usefulness 
for  other  NLP  tasks,  as  well  as  the  theoretical  in¬ 
terest  in  understanding  the  computational  mech¬ 
anisms  involved  in  government,  binding  and  lin¬ 
guistic  reference. 

Several  formal  evaluations  have  been  conducted 
for  the  coreference  resolution  task  (e.g.,  MUC-6 
(1995),  ACE  NIST  (2004)),  and  the  data  sets  cre¬ 
ated  for  these  evaluations  have  become  standard 
benchmarks  in  the  field  (e.g.,  MUC  and  ACE  data 
sets).  However,  it  is  still  frustratingly  difficult  to 
compare  results  across  different  coreference  res¬ 
olution  systems.  Reported  coreference  resolu¬ 
tion  scores  vary  wildly  across  data  sets,  evaluation 
metrics,  and  system  configurations. 
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We  believe  that  one  root  cause  of  these  dispar¬ 
ities  is  the  high  cost  of  implementing  an  end-to- 
end  coreference  resolution  system.  Coreference 
resolution  is  a  complex  problem,  and  successful 
systems  must  tackle  a  variety  of  non-trivial  sub¬ 
problems  that  are  central  to  the  coreference  task  — 
e.g.,  mention/markable  detection,  anaphor  identi¬ 
fication  —  and  that  require  substantial  implemen¬ 
tation  efforts.  As  a  result,  many  researchers  ex¬ 
ploit  gold-standard  annotations,  when  available,  as 
a  substitute  for  component  technologies  to  solve 
these  subproblems.  For  example,  many  published 
research  results  use  gold  standard  annotations  to 
identify  NPs  (substituting  for  mention/markable 
detection),  to  distinguish  anaphoric  NPs  from  non- 
anaphoric  NPs  (substituting  for  anaphoricity  de¬ 
termination),  to  identify  named  entities  (substitut¬ 
ing  for  named  entity  recognition),  and  to  identify 
the  semantic  types  of  NPs  (substituting  for  seman¬ 
tic  class  identification).  Unfortunately,  the  use  of 
gold  standard  annotations  for  key/critical  compo¬ 
nent  technologies  leads  to  an  unrealistic  evalua¬ 
tion  setting,  and  makes  it  impossible  to  directly 
compare  results  against  coreference  resolvers  that 
solve  all  of  these  subproblems  from  scratch. 

Comparison  of  coreference  resolvers  is  further 
hindered  by  the  use  of  several  competing  (and 
non-trivial)  evaluation  measures,  and  data  sets  that 
have  substantially  different  task  definitions  and 
annotation  formats.  Additionally,  coreference  res¬ 
olution  is  a  pervasive  problem  in  NLP  and  many 
NLP  applications  could  benefit  from  an  effective 
coreference  resolver  that  can  be  easily  configured 
and  customized. 

To  address  these  issues,  we  have  created  a  plat¬ 
form  for  coreference  resolution,  called  Reconcile, 
that  can  serve  as  a  software  infrastructure  to  sup¬ 
port  the  creation  of,  experimentation  with,  and 
evaluation  of  coreference  resolvers.  Reconcile 
was  designed  with  the  following  seven  desiderata 
in  mind: 

•  implement  the  basic  underlying  software  ar- 
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chitecture  of  contemporary  state-of-the-art 
learning-based  coreference  resolution  sys¬ 
tems; 

•  support  experimentation  on  most  of  the  stan¬ 
dard  coreference  resolution  data  sets; 

•  implement  most  popular  coreference  resolu¬ 
tion  scoring  metrics; 

•  exhibit  state-of-the-art  coreference  resolution 
performance  (i.e.,  it  can  be  configured  to  cre¬ 
ate  a  resolver  that  achieves  performance  close 
to  the  best  reported  results); 

•  can  be  easily  extended  with  new  methods  and 
features; 

•  is  relatively  fast  and  easy  to  configure  and 
run; 

•  has  a  set  of  pre-built  resolvers  that  can  be 
used  as  black-box  coreference  resolution  sys¬ 
tems. 

While  several  other  coreference  resolution  sys¬ 
tems  are  publicly  available  (e.g.,  Poesio  and 
Kabadjov  (2004),  Qiu  et  al.  (2004)  and  Versley  et 
al.  (2008)),  none  meets  all  seven  of  these  desider¬ 
ata  (see  Related  Work).  Reconcile  is  a  modular 
software  platform  that  abstracts  the  basic  archi¬ 
tecture  of  most  contemporary  supervised  learning- 
based  coreference  resolution  systems  (e.g..  Soon 
et  al.  (2001),  Ng  and  Cardie  (2002),  Bengtson  and 
Roth  (2008))  and  achieves  performance  compara¬ 
ble  to  the  state-of-the-art  on  several  benchmark 
data  sets.  Additionally,  Reconcile  can  be  eas¬ 
ily  reconfigured  to  use  different  algorithms,  fea¬ 
tures,  preprocessing  elements,  evaluation  settings 
and  metrics. 

In  the  rest  of  this  paper,  we  review  related  work 
(Section  2),  describe  Reconcile’s  organization  and 
components  (Section  3)  and  show  experimental  re¬ 
sults  for  Reconcile  on  six  data  sets  and  two  evalu¬ 
ation  metrics  (Section  4). 

2  Related  Work 

Several  coreference  resolution  systems  are  cur¬ 
rently  publicly  available.  JavaRap  (Qiu  et  ah, 
2004)  is  an  implementation  of  the  Lappin  and 
Leass’  (1994)  Resolution  of  Anaphora  Procedure 
(RAP).  JavaRap  resolves  only  pronouns  and,  thus, 
it  is  not  directly  comparable  to  Reconcile.  GuiTaR 


(Poesio  and  Kabadjov,  2004)  and  BART  (Versley 
et  ah,  2008)  (which  can  be  considered  a  succes¬ 
sor  of  GuiTaR)  arc  both  modular  systems  that  tar¬ 
get  the  full  coreference  resolution  task.  As  such, 
both  systems  come  close  to  meeting  the  majority 
of  the  desiderata  set  forth  in  Section  1.  BART, 
in  particular,  can  be  considered  an  alternative  to 
Reconcile,  although  we  believe  that  Reconcile’s 
approach  is  more  flexible  than  BART’s.  In  addi¬ 
tion,  the  architecture  and  system  components  of 
Reconcile  (including  a  comprehensive  set  of  fea¬ 
tures  that  draw  on  the  expertise  of  state-of-the-art 
supervised  learning  approaches,  such  as  Bengtson 
and  Roth  (2008))  result  in  performance  closer  to 
the  state-of-the-art. 

Coreference  resolution  has  received  much  re¬ 
search  attention,  resulting  in  an  array  of  ap¬ 
proaches,  algorithms  and  features.  Reconcile 
is  modeled  after  typical  supervised  learning  ap¬ 
proaches  to  coreference  resolution  (e.g.  the  archi¬ 
tecture  introduced  by  Soon  et  al.  (2001))  because 
of  the  popularity  and  relatively  good  performance 
of  these  systems. 

However,  there  have  been  other  approaches 
to  coreference  resolution,  including  unsupervised 
and  semi-supervised  approaches  (e.g.  Haghighi 
and  Klein  (2007)),  structured  approaches  (e.g. 
McCallum  and  Wellner  (2004)  and  Finley  and 
Joachims  (2005)),  competition  approaches  (e.g. 
Yang  et  al.  (2003))  and  a  bell-tree  search  approach 
(Luo  et  al.  (2004)).  Most  of  these  approaches  rely 
on  some  notion  of  pairwise  feature -based  similar¬ 
ity  and  can  be  directly  implemented  in  Reconcile. 

3  System  Description 

Reconcile  was  designed  to  be  a  research  testbed 
capable  of  implementing  most  current  approaches 
to  coreference  resolution.  Reconcile  is  written  in 
Java,  to  be  portable  across  platforms,  and  was  de¬ 
signed  to  be  easily  reconfigurable  with  respect  to 
subcomponents,  feature  sets,  parameter  settings, 
etc. 

Reconcile’s  architecture  is  illustrated  in  Figure 
1.  For  simplicity.  Figure  1  shows  Reconcile’s  op¬ 
eration  during  the  classification  phase  (i.e.,  assum¬ 
ing  that  a  trained  classifier  is  present). 

The  basic  architecture  of  the  system  includes 
five  major  steps.  Starting  with  a  corpus  of  docu¬ 
ments  together  with  a  manually  annotated  corefer¬ 
ence  resolution  answer  key1,  Reconcile  performs 

'Only  required  during  training. 
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Figure  1:  The  Reconcile  classification  architecture. 


the  following  steps,  in  order: 

1.  Preprocessing.  All  documents  are  passed 
through  a  series  of  (external)  linguistic  pro¬ 
cessors  such  as  tokenizers,  part-of-speech 
taggers,  syntactic  parsers,  etc.  These  com¬ 
ponents  produce  annotations  of  the  text.  Ta¬ 
ble  1  lists  the  preprocessors  currently  inter¬ 
faced  in  Reconcile.  Note  that  Reconcile  in¬ 
cludes  several  in-house  NP  detectors,  that 
conform  to  the  different  data  sets’  defini¬ 
tions  of  what  constitutes  a  NP  (e.g.,  MUC 
vs.  ACE).  All  of  the  extractors  utilize  a  syn¬ 
tactic  parse  of  the  text  and  the  output  of  a 
Named  Entity  (NE)  extractor,  but  extract  dif¬ 
ferent  constructs  as  specialized  in  the  corre¬ 
sponding  definition.  The  NP  extractors  suc¬ 
cessfully  recognize  about  95%  of  the  NPs  in 
the  MUC  and  ACE  gold  standards. 

2.  Feature  generation.  Using  annotations  pro¬ 
duced  during  preprocessing,  Reconcile  pro¬ 
duces  feature  vectors  for  pairs  of  NPs.  For 
example,  a  feature  might  denote  whether  the 
two  NPs  agree  in  number,  or  whether  they 
have  any  words  in  common.  Reconcile  in¬ 
cludes  over  80  features,  inspired  by  other  suc¬ 
cessful  coreference  resolution  systems  such 
as  Soon  et  al.  (2001)  and  Ng  and  Cardie 
(2002). 

3.  Classification.  Reconcile  learns  a  classifier 
that  operates  on  feature  vectors  representing 


Task 

Systems 

Sentence 

splitter 

UIUC  (CC  Group,  2009) 

OpenNLP  (Baldridge,  J„  2005) 

Tokenizer 

OpenNLP  (Baldridge,  J„  2005) 

POS 

Tagger 

OpenNLP  (Baldridge,  J„  2005) 

+  the  two  parsers  below 

Parser 

Stanford  (Klein  and  Manning,  2003) 
Berkeley  (Petrov  and  Klein,  2007) 

Dep.  parser 

Stanford  (Klein  and  Manning,  2003) 

NE 

Recognizer 

OpenNLP  (Baldridge,  J„  2005) 
Stanford  (Finkel  et  al.,  2005) 

NP  Detector 

In-house 

Table  1:  Preprocessing  components  available  in 
Reconcile. 

pairs  of  NPs  and  it  is  trained  to  assign  a  score 
indicating  the  likelihood  that  the  NPs  in  the 
pair  are  coreferent. 

4.  Clustering.  A  clustering  algorithm  consoli¬ 
dates  the  predictions  output  by  the  classifier 
and  forms  the  final  set  of  coreference  clusters 
(chains).2 

5.  Scoring.  Finally,  during  testing  Reconcile 
runs  scoring  algorithms  that  compare  the 
chains  produced  by  the  system  to  the  gold- 
standard  chains  in  the  answer  key. 

Each  of  the  five  steps  above  can  invoke  differ¬ 
ent  components.  Reconcile’s  modularity  makes  it 

2Some  structured  coreference  resolution  algorithms  (e.g., 
McCallum  and  Wellner  (2004)  and  Finley  and  Joachims 
(2005))  combine  the  classification  and  clustering  steps  above. 
Reconcile  can  easily  accommodate  this  modification. 
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Step 

Available  modules 

Classification 

various  learners  in  the  Weka  toolkit 
libSVM  (Chang  and  Lin,  2001) 

SVM ught  (Joachims,  2002) 

Clustering 

Single-link 

Best-First 

Most  Recent  First 

Scoring 

MUC  score  (Vilain  et  al.,  1995) 

B3  score  (Bagga  and  Baldwin,  1998) 
CEAF  score  (Luo,  2005) 

Table  2:  Available  implementations  for  different 
modules  available  in  Reconcile. 

easy  for  new  components  to  be  implemented  and 
existing  ones  to  be  removed  or  replaced.  Recon- 
cile’s  standard  distribution  comes  with  a  compre¬ 
hensive  set  of  implemented  components  -  those 
available  for  steps  2-5  arc  shown  in  Table  2.  Rec¬ 
oncile  contains  over  38,000  lines  of  original  Java 
code.  Only  about  15%  of  the  code  is  concerned 
with  running  existing  components  in  the  prepro¬ 
cessing  step,  while  the  rest  deals  with  NP  extrac¬ 
tion,  implementations  of  features,  clustering  algo¬ 
rithms  and  scorers.  More  details  about  Recon- 
cile’s  architecture  and  available  components  and 
features  can  be  found  in  Stoyanov  et  al.  (2010). 

4  Evaluation 

4.1  Data  Sets 

Reconcile  incorporates  the  six  most  commonly 
used  coreference  resolution  data  sets,  two  from  the 
MUC  conferences  (MUC-6,  1995;  MUC-7,  1997) 
and  four  from  the  ACE  Program  (NIST,  2004). 
For  ACE,  we  incorporate  only  the  newswire  por¬ 
tion.  When  available.  Reconcile  employs  the  stan¬ 
dard  test/train  split.  Otherwise,  we  randomly  split 
the  data  into  a  training  and  test  set  following  a 
70/30  ratio.  Performance  is  evaluated  according 
to  the  B3  and  MUC  scoring  metrics. 

4.2  The  Reconcile 2010  Configuration 

Reconcile  can  be  easily  configured  with  differ¬ 
ent  algorithms  for  markable  detection,  anaphoric- 
ity  determination,  feature  extraction,  etc.,  and  run 
against  several  scoring  metrics.  For  the  purpose  of 
this  sample  evaluation,  we  create  only  one  partic¬ 
ular  instantiation  of  Reconcile,  which  we  will  call 
Reconcile 2010  to  differentiate  it  from  the  general 
platform.  Reconcile 2010  is  configured  using  the 
following  components: 

1.  Preprocessing 

(a)  Sentence  Splitter:  OpenNLP 


(b)  Tokenizer:  OpenNLP 

(c)  POS  Tagger:  OpenNLP 

(d)  Parser:  Berkeley 

(e)  Named  Entity  Recognizer:  Stanford 

2.  Feature  Set  -  A  hand-selected  subset  of  60  out  of  the 
more  than  80  features  available.  The  features  were  se¬ 
lected  to  include  most  of  the  features  from  Soon  et  al. 
Soon  et  al.  (2001).  Ng  and  Cardie  (2002)  and  Bengtson 
and  Roth  (2008). 

3.  Classifier  -  Averaged  Perceptron 

4.  Clustering  -  Single-link  -  Positive  decision  threshold 
was  tuned  by  cross  validation  of  the  training  set. 

4.3  Experimental  Results 

The  first  two  rows  of  Table  3  show  the  perfor¬ 
mance  of  Reconcile 2010-  For  all  data  sets,  B3 
scores  are  higher  than  MUC  scores.  The  MUC 
score  is  highest  for  the  MUC6  data  set,  while  B3 
scores  are  higher  for  the  ACE  data  sets  as  com¬ 
pared  to  the  MUC  data  sets. 

Due  to  the  difficulties  outlined  in  Section  1, 
results  for  Reconcile  presented  here  are  directly 
comparable  only  to  a  limited  number  of  scores 
reported  in  the  literature.  The  bottom  three 
rows  of  Table  3  list  these  comparable  scores, 
which  show  that  Reconcile 2010  exhibits  state-of- 
the-art  performance  for  supervised  learning-based 
coreference  resolvers.  A  more  detailed  study  of 
Reconcile -based  coreference  resolution  systems 
in  different  evaluation  scenarios  can  be  found  in 
Stoyanov  et  al.  (2009). 

5  Conclusions 

Reconcile  is  a  general  architecture  for  coreference 
resolution  that  can  be  used  to  easily  create  various 
coreference  resolvers.  Reconcile  provides  broad 
support  for  experimentation  in  coreference  reso¬ 
lution,  including  implementation  of  the  basic  ar¬ 
chitecture  of  contemporary  state-of-the-art  coref¬ 
erence  systems  and  a  variety  of  individual  mod¬ 
ules  employed  in  these  systems.  Additionally, 
Reconcile  handles  all  of  the  formatting  and  scor¬ 
ing  peculiarities  of  the  most  widely  used  coref¬ 
erence  resolution  data  sets  (those  created  as  part 
of  the  MUC  and  ACE  conferences)  and,  thus, 
allows  for  easy  implementation  and  evaluation 
across  these  data  sets.  We  hope  that  Reconcile 
will  support  experimental  research  in  coreference 
resolution  and  provide  a  state-of-the-art  corefer¬ 
ence  resolver  for  both  researchers  and  application 
developers.  We  believe  that  in  this  way  Recon¬ 
cile  will  facilitate  meaningful  and  consistent  com¬ 
parisons  of  coreference  resolution  systems.  The 
full  Reconcile  release  is  available  for  download  at 
http : / /www . cs . Utah . edu/ nip/ reconcile/. 
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System 

Score 

Data  sets  j 

MUC6 

MUC7 

ACE-2 

ACE03 

ACE04 

ACE05 

Reconcile  2010 

MUC 

B3 

68.50 

70.88 

62.80 

65.86 

65.99 

78.29 

67.87 

79.39 

62.03 

76.50 

67.41 

73.71 

Soon  et  al.  (2001) 

MUC 

62.6 

60.4 

- 

- 

- 

- 

Ng  and  Cardie  (2002) 

MUC 

70.4 

63.4 

- 

- 

- 

- 

Yang  et  al.  (2003) 

MUC 

71.3 

60.2 

- 

- 

- 

- 

Table  3:  Scores  for  Reconcile  on  six  data  sets  and  scores  for  comparable  coreference  systems. 
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